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Abstract: ChEMBL, DrugBank, Human Metabolome Data- 
base and the Therapeutic Target Database are resources of 
curated chemistry-to-protein relationships widely used in 
the chemogenomic arena. In this work we have extended 
an earlier analysis (PMID 22821596) by comparing chemis- 
try and protein target content between 2010 and 2013. For 
the former, details are presented for overlaps and differen- 
ces, statistics of stereochemistry as well as stereo represen- 
tation and MW profiles between the four databases. For 
2013 our results indicate quality improvements, major ex- 



pansion, increased achiral structures and changes in MW 
distributions. An orthogonal comparison of chemical con- 
tent with different sources inside PubChem highlights fur- 
ther interpretable differences. Expansion of protein content 
by UniProt IDs is also recorded for 2013 and Gene Ontology 
comparisons for human-only sets indicate differences. 
These emphasise the expanding complementarity of 
chemistry-to-protein relationships between sources, al- 
though different criteria are used for their capture. 
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1 Introduction 

Databases that include explicit mappings between proteins 
and the small-molecules that interact with them as bioac- 
tivity modulators offer expanding opportunities in chemo- 
genomics and pharmacological informatics. However, their 
proliferation also presents challenges. One of these is to 
discern incremental utility of individual resources and their 
combinations in various portals, for particular tasks. The in- 
terpretation of integrated results needs an understanding 
of each database from which they are extracted. 111 This is 
essential to judge between the inevitable noise and dis- 
cordance in merged entities or result relationships. In addi- 
tion, the reassurance engendered by apparent independent 
concordance can be confounded by the increasing circulari- 
ty of data records (i.e. re-cycling of the same primary data 
between databases). 

The key to assessing utility is to compare databases in 
detail and thereby acquire an understanding of the differ- 
ent rules by which they have been populated. This work 
outlines ways of approaching this by using four well-estab- 
lished and high-value databases: ChEMBL, 121 DrugBank, 131 
Human Metabolome Database (HMDB), [41 and the Thera- 
peutic Target Database (TTD). [51 We undertook a study of 
these four databases in 2010, although this was not pub- 
lished until 201 2. 161 This new work extends our earlier study 
in two main ways. Firstly, all four resources have undergone 
major updates. We can thus now gain unique insights from 
comparing snapshots taken approximately four years apart. 
Secondly, developments such as wider adoption of the 
InChl, the inclusion of all four sources in PubChem, new 



cross-references in the UniProt database and additional 
cheminformatic options, have allowed us to expand the 
scope of the 2013 analysis. Since 2010 new methods for in- 
dexing molecules have been described, including an ex- 
tended version of the Morgan algorithm, and compared 
with existing ones. 171 However, in the interests of compara- 
tive consistency between our two studies we have retained 
the main features of our previous analysis pipeline. Addi- 
tional context to this work is provided by new publications 
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that have since appeared from each database. Notwith- 
standing, brief summaries are provided below, along with 
self-reported entity counts from the release versions used 
in this work. 

- ChEMBL data is mainly curated from journals covering 
a significant fraction of global medicinal chemistry re- 
ports and structure-activity-relationship (SAR) results. Re- 
lease 15 (January 2013) specifies on the website: 9570 
targets, 1 254575 distinct compounds, 10509572 activi- 
ties and 48 735 publications (n.b. release 16 appeared as 
this work was being finalised). 

- DrugBank collates target and mechanism-of-action infor- 
mation. Version 3.0 (January 2011) contains 6715 drug 
entries including 1452 FDA-approved small molecules, 
131 biologicals, 86 nutraceuticals and 5076 experimental 
compounds. These are mapped to 4233 protein IDs. Half 
the detailed information in the records is devoted to the 
drug, the other half to sequences, pharmacological prop- 
erties, pharmacogenomic data, food-drug interactions, 
drug-drug interactions and experimental ADME data. 

- HMDB collates detailed chemical, clinical and biochemi- 
cal data on human metabolites. These are linked to 
other databases including enzymes involved in the trans- 
formations. Version 3.0 (September 2012) contains 40437 
chemical entries and 5650 protein sequence identifiers. 
Because they have both been developed at the same in- 
stitution, linkages are provided between DrugBank and 
HMDB at the compound, protein and pathway levels. In 
May 2013, HMDB switched their version number to 3.5, 
but without major changes in the data. 

- TTD is conceptually similar to DrugBank but the com- 
pound-to-target mappings are focussed on primary tar- 
gets. Another difference is the three-way split of targets 
and compounds into marketed, clinical trial and research 
phase. The latest version 4.3.02 (August 2011) includes 
2025 targets, 17816 chemical structures, including 1540 
approved drugs. 



2 Methods 

Our analysis is divided between the two main themes of 
chemistry and proteins. The following section will outline 
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the basic steps and how these were enhanced in 2013 but 
for a full oversight we recommend consulting our previous 
publication. 161 

2.1 Chemistry Comparison 

For the extrinsic analysis of chemistry we included the sets 
available for download as SD file from each website in Sep- 
tember 2010 with those available in January 2013. The 
questions we wanted to answer are how the set of chemi- 
cal structures and the number of unique structures in each 
of the four databases has grown, how much the older ver- 
sion (2010) overlaps with the newer one (2013), and also 
how the structural overlap between the databases has 
changed. Table 1 provides an overview of versions, struc- 
ture record counts of the original files and a comparison to 
the number of current Substance records in PubChem 
(generated by PubChem Query "Database Name"[Source- 
Name]). For all databases we found small variations in 
record counts between the downloadable 2013 SD files, 
the Substance (SID) count in PubChem and structure 
counts mentioned on the databases. In cases where these 
discrepancies were large we sought to provide an explana- 
tion. 

In comparison to 2010, ChEMBL has more than doubled 
in the latest release (version 15, January 2013). The SID 
count for ChEMBL in PubChem is about 450 000 records 
smaller than the direct download from ChEMBL (see Pub- 
Chem comparison section below). DrugBank has grown 
-30% between version 2.0 and 3.0 published in 2011. For 
HMDB, we used 2.5 in our 2010 study. The SD file from Sep- 
tember 2010 contained 7888 records (although some failed 
processing) but when 2.5 was re-downloaded in early 2013 
we found 8553 structure records. The latest HMDB (3.0) has 
grown to -40000 records in the download file. TTD in- 
creased between 2010 and 2013 to -15000 structure re- 
cords and has a similar count of Substances in PubChem 
(14771 SIDs). 

The content of all original database SD files were pro- 
cessed with the cheminformatics toolkit CACTVS.' 81 The 
structure records were normalised to our standards on 
basis of the rule sets implemented for the NCI/CADD identi- 
fiers FICTS, FICuS and uuuuu. 191 These differ in their normali- 
sation modes, i.e. they have varying levels of sensitivity to 



Table 1. Versions and file downloads. 



Database 


Year 


Version 


Release date 


Declared source 
record count 


Substance (SID) count in 
PubChem (May 2013) 


ChEMBL 


2010 


6 


2010-09-02 


600625 


804093 




2013 


15 


2013-01-30 


1 251 913 




DrugBank 


2010 


2.0 


2008-01-31 


4886 


6683 




2013 


3.0 


2011-01-31 


6516 




HMDB 


2010 


2.5 


2010-09-19 


7888 


8550 




2013 


3.0 


2012-09-15 


40 209 




TTD 


2010 




2010-09-19 


3616 


14771 




2013 




2011-08-25 


15 009 
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certain molecular and atomic features. They thus have dif- 
ferent scopes of what is regarded as a chemically unique 
structure. In the first step of normalisation a unique repre- 
sentation of stereochemistry, charged resonance structures, 
miss-drawn functional groups, undefined hydrogen atoms, 
undefined charges, and incorrect valences, are all ad- 
dressed. From this level, a numeric hash code representa- 
tion is generated that uniquely represents the normalised 
structure and establishes the FICTS identifier of the input 
structure. In addition, for the FICuS identifier, a canonical 
tautomeric representation is created before the hash code 
is calculated. This means the FICuS identifier is able to link 
different tautomeric forms of the same chemical compound 
as they occur in different or even the same source databas- 
es. Finally, the uuuuu identifier also disregards counter ions, 
stereochemistry, isotopic labelling, and formal charges in 
comparison to the FICuS identifier. It is thus useful to find 
related forms of the same chemical compound which share 
the same basic connectivity and skeleton, irrespective of 
tautomers, stereoisomers, salt forms or charged species. 

For comparisons, the IUPAC International Chemical Iden- 
tifier (InChl) Standard InChlKey (version 1.04) were calculat- 
ed for all original structure records." 01 For this analysis, we 
also calculated a second set of InChlKey for which the new 
InChl flags "KET" and T13" were switched on. In the latest 
version of the InChl library these allow for a stricter han- 
dling of tautomerism compared to Standard InChlKey and 
may be incorporated into a forthcoming version. All chemi- 
cal identifiers were stored and organized in a MySQL data- 
base for further analysis. We also maintained the original 
record IDs of the source database (e.g. DBxxxxx, 
HMDBxxxxx and CHEMBLxxxxx) including the release desig- 
nations used in Table 1. For the set of normalized (unique) 
structures in the database we used CACTVS to calculate 
molecular weights and comparative statistics related to the 
quality of stereo information. 

Unlike 2010, when TTD and HMDB were not yet present, 
all four databases are now specifically selectable as submis- 
sion sources in PubChem. There are two caveats. Firstly, 
completion of the submission of all HMDB structures is still 
pending because of curation updates. 1111 The second is that, 
as explained in ChEMBL v.10 release notes of June 2011, 
ChEMBL substantial increased the number of compounds 
by including PubChem confirmatory BioAssays with dose- 
response endpoints (e.g., IC50, Ki, or potency). As a conse- 
quence, while ChEMBL v.15 declares 1 254575 structures for 
the direct download, the PubChem query "ChEMBL"[Sour- 
ceName] retrieves 804 093 CIDs. Thus, approximately 
450000 structures from PubChem were imported into 
ChEMBL v.15. Notwithstanding these caveats, we took ad- 
vantage of the PubChem toolbox to perform various com- 
parisons. These are not only different in the information 
they provide, but are also complementary to the analysis of 
direct downloads. The filters used for the intersections 
were a combination of the default PubChem settings (seen 
on the lower left of any query result page) and three set 



up for this work. One which needs some explanation is 
patent occurrence. This was the union (in size order) of 
SureChemOpen, Thomson Pharma, SCRIPDB and IBM as 
submitting PubChem sources of patent-extracted struc- 
tures. 

2.2 Proteins 

Comparisons of protein content were performed using lists 
of UniProt protein identifiers derived from each source. For 
TTD these were parsed from a text dump of the records. 1121 
For DrugBank the external database and ID links for drug 
targets were retrieved from the download interface. 1131 
Since HMDB was undergoing a site upgrade the equivalent 
UniProt IDs were obtained directly (courtesy of Dr Craig 
Knox). Because ChEMBL now has direct links from UniProt 
the appropriate query was used to retrieve the ID list. 1141 
These links were first instigated for ChEMBL v.14 in Novem- 
ber 2012 and may not have yet been synched to ChEMBL 
v.15. However, because of the convenience of being able to 
query and intersect these entries directly from the UniProt 
interface, we have used this for the 2013 protein set. The 
Venny web tool was used to generate Venn diagrams. 1151 
The version numbers and dates are given in Table 1 but for 
convenience we refer from this point on to the historical 
and contemporary sets as "2010" and "2013", respectively. 



3 Results for Chemistry 

Table 2 lists the number of unique structure records in the 
four databases. 

Because of their different input normalisation stringen- 
cies, each of the identifiers indicates an expected reduction 
in the number of unique structures compared to the origi- 
nal counts, with bracketed numbers giving the percentage 
of this value. The upper part of Table 2 is a repeat of our 
2010 results but performed with enhanced 2013 process- 
ing. These include InChl 1.04 instead of 1.03 and a newer 
version of CACTVS (academic 3.410 version, February 2012). 
This improved reading capabilities and comparability to 
processing of the most recent database releases. Neverthe- 
less, we recorded only minor variations between our 2010 
results and the repeated calculation in the upper part of 
Table 2. 

The comparison of the unique structure counts obtained 
by Standard InChlKey vs. FICTS (Table 2) shows similar re- 
sults for all four databases and intermediate releases. This 
might seem unexpected as the Standard InChlKey already 
includes some basic handling of tautomerism, while the 
FICTS identifier does not normalise tautomers. However, we 
have seen similar behaviour in other cases. For each of the 
two releases of ChEMBL and HMDB we found only a small 
number of duplicates by Standard InChlKey and FICTS 
(0.1% or 0.2%, respectively), while for DrugBank the 
number of duplicates decreases from -5% to -2% be- 
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Table 2. Unique structure counts. These were determined by Standard InChlKey, FICTS, and InChlKey with tautomer flags set to T13" and 
"KET", FICuS, and uuuuu. They are presented for the "2010" and "2013" release of each database, together with, in brackets, the percentage 
of the original structure counts in Table 1. 



Database (2010) 


Version 


Standard InChlKey (%) 


FICTS (%) 


Tauto. InChlKey (%) 


FICuS (%) 


uuuuu (%) 


ChEMBL 


6 


599879 


599862 


592419 


598625 


558 260 






(99.8) 


(99.8) 


(98.6) 


(99.6) 


(92.9) 


DrugBank 


2.0 


4674 


4675 


4666 


4665 


4529 






(95.6) 


(95.7) 


(95.5) 


(95.5) 


(92.6) 


HMDB 


2.5 


7877 


7877 


7849 


7851 


7488 






(99.8) 


(99.8) 


(99.5) 


(99.5) 


(94.9) 


TTD 


2010-09-19 


2834 


2857 


2826 


2833 


2574 






(78.3) 


(79.0) 


(78.1) 


(78.3) 


(71.1) 


Database (2013) 


Version 


Standard InChlKey (%) 


FICTS (%) 


Tauto. InChlKey (%) 


FICuS (%) 


uuuuu (%) 


ChEMBL 


15 


1 251 500 


1 251 500 


1 240676 


1 247 717 


1 127013 






(99.9) 


(99.9) 


(99.1) 


(99.7) 


(90.1) 


DrugBank 


3.0 


6379 


6404 


6362 


6395 


6198 






(97.9) 


(98.2) 


(97.6) 


(98.1) 


(95.1) 


HMDB 


3.0 


40154 


40196 


40108 


40094 


39131 






(99.8) 


(99.9) 


(99.7) 


(99.7) 


(97.3) 


TTD 


2012-10-01 


14111 


14152 


13974 


14107 


13 306 






(94.0) 


(94.2) 


(93.1) 


(93.9) 


(88.6) 



tween both database versions. A dramatic reduction can be 
seen for the 2010 version of TTD for which the number of 
unique structures is 20% lower than the original structure 
record count. The change to -6% in the 2013 version indi- 
cates an improvement in the quality of chemical structures 
in TTD. 

If a stricter handling of tautomerism is performed via the 
FICuS identifier and the second set of calculated InChlKey 
(i.e. switching on the InChl tautomer flags "T13" and "KET"), 
the unique structure counts reduce further. However, the 
effect is small if compared to the numbers of FICTS and 
Standard InChlKey. Changes and improvements between 
2010 and 2013 are very similar for all databases and their 
releases. 

The uuuuu identifier offers a diversity assessment of 
basic connectivity (i.e. it disregards that "diversity" created 
by different stereoisomers, charged forms, salts as well as 
tautomeric forms). It also estimates the number of unique 
chemical skeletons (including bond orders). Thus, for 
ChEMBL, DrugBank, and HMDB, the uuuuu counts of 
unique structures are -6 and -12% lower than the original 
structure count, with the exception of the 2010 version of 
TTD that dropped -30%. 

The impression given by Table 2 is that the database 
teams have reduced duplicates and improved their han- 
dling of different tautomeric forms of the same canonical 
compound. Note that from the data per se if a database 
both expands and improves on average, we cannot dis- 
criminate between remediation of existing structures, 
marked improvements of just the new ones, or both. 

Our discussion of content overlap will be restricted 
mainly to Standard InChlKey since this is the most estab- 
lished identifier. However, analysis by uuuuu also reveals in- 
teresting aspects. Figure 1 illustrates the individual changes 



between 2010 and 2013. The union sets of the Venn dia- 
grams give the number of structures that have been main- 
tained in both versions of each database, while the counts 
outside the union sets show the number of removed and 
newly added unique structures, respectively. 

The much smaller number of unique structures reported 
by the uuuuu identifier can be explained in most cases by 
the different chemical scopes of uuuuu vs. Standard InChl- 
Key identifier. For instance, for the 2010 release of ChEMBL 
we record 2669 unique structures by uuuuu compared to 
25 588 uniques by Standard InChlKey (Figure 1). This occurs 
mainly because of two effects. Firstly, the 25 588 unique 
structures found by Standard InChlKey possess a low diver- 
sity. This means they form a much smaller set of unique 
structures when compared on basis of their basic connec- 
tivity, since this is what the uuuuu identifier is intended to 
do by disregarding stereoisomers, counterions, etc. Second- 
ly, there are other cases where the uuuuu identifier sub- 
sumes structure records from the set exclusive to a single 
release into the union set of both releases. This occurs for 
records where the basic connectivity does not change be- 
tween old and new database releases but improvements 
have been incorporated on some level in the newer release 
(e.g. adding or correcting stereochemistry). For these, the 
linkage between "original" and "improved" structure can 
only be established by disregarding the improvements. 
This is particularly the case for the union set of HMDB in 
Figure 1 where the count of unique structures by uuuuu is 
larger than by Standard InChlKey (4051 vs. 2360). Thus, 
these numbers also imply improved curation. 

Table 3 shows the pairwise overlaps between the data- 
bases using Standard InChlKey for the database versions 
analysed in 2010 (upper part) and 2013 (lower part). 
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Figure 1. Venn comparisons for the 2010 and 2013 versions. The main numbers are overlap by Standard InChlKey, those underneath in 
italics and brackets are by uuuuu. 



Unsurprisingly, since they all expanded, the overlaps be- 
tween databases have increased in absolute numbers (the 
numbers in the main diagonal indicate the number of 
unique records by Standard InChlKey). For 2013 ChEMBL 
covers now substantially larger parts of DrugBank (up from 
37% to 55%), and TTD (up from 57% to 91%) although 



TTD has grown itself in absolute numbers from 2834 to 
14111 structure records (unique by Standard InChlKey). 

Figure 2 confirms this. The number of exclusive struc- 
tures in TTD increases only moderately. However, the exclu- 
sive, mutual overlap with ChEMBL increases substantially 
(703 to 11 398 unique structures). The new structures in 
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HMBD seem to be largely unique content (see Table 3 and 
Figure 2) as expected considering HMDB's focus on metab- 
olites. DrugBank's unique content does not change sub- 
stantially, but has a much bigger overlap with ChEMBL 
Given that ChEMBL is by far the largest database (and ef- 
fectively doubled to 1.2 million), it is neither surprising that 
most of its content is unique relative to the other three da- 
tabases nor that the overlaps concomitantly increase. 

The set of structures in common (the centre union set in 
the Venn diagrams of Figure 2) increased from 115 to 1047 
structures between 2010 and 2013. In 2010, we were sur- 
prised to record such low intersects since three of the data- 
bases should have included the same set of FDA-approved 
drugs. However, similar low intersects have been noted in 
an earlier comparison of drug databases." 1 In 2013, the 4- 
way union set (1047 structures unique by Standard InChl- 
Key and 1 270 by uuuuu) is closer to what three of the data- 
bases report as the numbers of approved drugs in their 
structure sets (TTD: 1540, DrugBank: 1424, and ChEMBL: 
1214) but it should be noted that, via the inclusion of 
HMDB, this includes both drugs and metabolites. 

A distribution of stereochemistry and stereo representa- 
tion is provided in Table 4. From our experience, all other 
factors being equal, these parameters are indirect indica- 
tors of improved structure quality because the correct rep- 
resentation of stereochemistry requires a careful handling. 
However, the absence of stereo, or even seemingly incor- 
rect representations may accurately represent what was 
specified in the extracted source, typically as an image or 
an IUPAC name (e.g. different journal papers referring to 
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Table 3. Overlap matrix for the four databases by Standard InChl- 
Key. 



(2010) 


ChEMBL 


DrugBank 


HMDB 


TTD 


ChEMBL 


599879 


1746 


901 


1622 


ui uu Da i i r\ 




4674 


362 


1 220 


HMDB 






7877 


163 


TTD 








2834 


(2013) 


ChEMBL 


DrugBank 


HMDB 


TTD 


ChEMBL 


1 251 500 


3335 


4254 


12893 


DrugBank 




6379 


1746 


1346 


HMDB 






40154 


1306 


TTD 








14111 



the same canonical structure). The statistics of this cannot 
therefore be used to assess curatorial accuracy. 

Table 4 list the number of structures (plus the percentage 
of the original record count) without chiral atom centres, 
fully specified stereo configuration on all atoms, and unde- 
fined stereo configuration on at least one chiral atom. The 
situation for bond stereochemistry is given by the following 
metrics: a) the number of structures with no stereogenic 
double bonds, b) those that have correctly specified stereo 
configuration on all double bonds, and c) those with at 
least one double bond for which stereo information is miss- 
ing. 

Notably, the percentage of achiral structures has in- 
creased in all four databases between 2010 and 2013. We 
suggest this is due to expansion beyond drug-like com- 
pounds. The absolute numbers with full stereo specification 



Table 4. Analysis of stereochemistry between the "2010" and "2013" versions of the databases. The columns are (left to right) number of 
structures with no chiral atom stereo centers, structures for which stereo configuration of all atoms is specified, at least one stereo atom 
center is unspecified, those without bond stereo centers, those for which the configuration of all bond stereo centers is given, and those 
where at least one bond stereo center is missing. The percentage of the original structure count in Table 1 is given in brackets. 



Database No Atom Stereo Full Atom Stereo Unspecified Atom Stereo No Bond Stereo Full Bond Stereo Unspecified Bond Stereo 
(2010) Centres (%) Specification (%) Specification (%) Centres (%) Specification (%) Specification (%) 



ChEMBL 


284553 


148848 


165 944 


530812 


69632 


176 




(47.4) 


(24.8) 


(27.6) 


(88.4) 


(11.6) 


(<0.1) 


DrugBank 


1550 


772 


2542 


4480 


360 


36 




(31.8) 


(15.8) 


(52.1) 


(91.9) 


(7.4) 


(<0.1) 


HMDB 


1042 


4763 


2046 


3051 


4824 


11 




(13.2) 


(60.4) 


(25.9) 


(38.7) 


(61.2) 


(<0.1) 


TTD 


1583 


1136 


868 


3223 


388 


5 




(43.8) 


(31.4) 


(2.4) 


(89.1) 


(10.7) 


(<0.1) 


Database 














(2013) 














ChEMBL 


706 502 


222 802 


320428 


1 103 833 


128 528 


19517 




(56.4) 


(17.8) 


(25.6) 


(88.2) 


(10.3) 


(1.6) 


DrugBank 


2622 


2848 


767 


6018 


469 


29 




(40.2) 


(43.7) 


(11.8) 


(92.4) 


(7.2) 


(<0.1) 


HMDB 


7547 


6161 


26 453 


15 098 


24 964 


147 




(18.8) 


(15.3) 


(65.8) 


(37.5) 


(62.1) 


(<0.1) 


TTD 


7826 


4026 


2886 


13471 


1311 


8 




(52.9) 


(27.2) 


(19.5) 


(91.1) 


(8.9) 


(<0.1) 
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Figure 3. Molecular weight distribution of the 2010 and 2013 versions of ChEMBL, DrugBank, HMDB, and TTD. For 2013 the MW distribu- 
tion of the exclusive structure records (i.e. exclusive content in Figure 2: 2521 records for DrugBank, 6870 records for HMDB, 877 records 
for TTD) is highlighted in black (for ChEMBL the grey and black distributions are basically identical). The solid line in each plot indicates the 
median, dashed lines represent Q1 and Q3, respectively. The statistics of these distributions are shown in Table 5. 
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has also increased but, because of overall growth, the rela- 
tive numbers have fallen. Nevertheless, our stereochemistry 
results also imply enhanced curation efforts. For example, 
in DrugBank the number of structures with undefined atom 
stereo configuration dropped from 52% in 2010 to 12% in 
2013. This is also recorded in the Venn diagram in Figure 1. 
For HMDB we recorded the opposite trend, where unspeci- 
fied stereo information jumps from -25% to -65%. The 
33 000 new structures in 2013 thus included many with un- 
defined stereo information. This may be associated with 
the increase in large lipids. HMDB has a much lower per- 
centage of structures that are achiral in both the 2010 and 
2013 versions. For bond stereo information it is striking 
that -90% of structures in the three drug-focused databas- 
es have no stereogenic double bonds. HMDB is the excep- 
tion (37.5%) because of the focus on metabolites. The re- 
maining two columns regarding bond stero information in 
Table 4 are difficult to interpret because double bonds are 
often arbitrarily drawn in E-configuration even though the 
actual stereo configuration is unknown. 

While a series of chemical property profiling such as 
LogP and Polar Surface Area (PSA) would be of interest we 
had to restrict ourselves to MW as the most comparatively 
informative. This highlighted distinct differences (Figure 3 
and Table 5). 



Table 5. Statistics for MW distributions displayed in Figure 3. The 
mean, median, standard deviation, lower (Q1) and upper (Q3) quar- 
tiles are shown. 





Mean 


Std. Dev 


Q1 


Median 


Q3 


ChEMBL 2010 


456 


297 


324 


405 


503 


ChEMBL 2013 


424 


245 


319 


387 


466 


DrugBank 2010 


364 


303 


205 


309 


428 


DrugBank 2013 


345 


198 


228 


321 


412 


HMDB 2010 


662 


404 


340 


702 


831 


HMDB 2013 


725 


408 


346 


812 


956 


TTD 2010 


400 


366 


258 


354 


466 


TTD 2013 


437 


407 


255 


320 


446 



Starting with ChEMBL we can see a continuous distribu- 
tion with the median -400. The implication is of a more 
"lead-like" than "drug-like" content.' 161 This would fit with 
what might be expected from the extraction of SAR from 
the medicinal chemistry literature where the primary mode 
of activity testing is in-vitro. The 2010 distribution indicated 
the high-MW content had (proportionally) dropped slightly. 
The much smaller DrugBank collection shows a discontinu- 
ous spiky distribution but the median MW drops by 
-100 Da into a more "drug-like" zone compared to 
ChEMBL. While there is a hint of bimodality for 2010 this 
has smoothed out by 2013 with a slight rise in the median. 
This fits with the inference that these are predominantly in 
vivo optimised compounds and/or PDB ligands with con- 
comitant lower average MW compared to ChEMBL. The 
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same effect would be predicted for TTD and this is ob- 
served. However, the median is slightly higher from a larger 
proportion of high-AW entries compared to DrugBank. 

The cumulative statistics of HMDB are confounded be- 
cause of the pronounced tri-modal 2013 distribution. This 
is not only a major change but also indicates two large 
new clusters around -1000 and -1500. These are outside 
the envelope we might have expected for small-molecule 
metabolites as seen in the 2010 pattern. Inspecting select- 
ed entries under these peaks provided an explanation. For 
example, the first peak included a long-chain monolignoce- 
ric acid triglyceride with a MW of 1027 (HMDB47321). 
Under the second peak we find a cardiolipin of MW 1526 
(HMDB57781). As a corroborative cross-check, the search 
terms "triglyceride" and "cardiolipin" retrieve 13923 and 
3277 results, respectively, which approximately fits the 
peak size ratio and supports a focus on large lipid capture 
for the latest update. 

This chemistry comparison section concludes with the re- 
sults from comparing the databases inside PubChem where 
they are now instantiated as separate sources. We chose 11 
categories of content to display in Figure 4. We can com- 
pare the 11 intersects for each database according to the 
order in Figure 4. The numbers in brackets after each cate- 
gory are the total CID counts for that filter in PubChem for 
May 2013. 

1. Literature-linked (962666). Because this is largely 
ChEMBL-plus-PubMed the former is obviously recorded as 
100%. The 60% coverage in DrugBank indicates lower pri- 
mary literature capture but note these could be secondary 
sources such as review articles. It does raise the question as 
to which structures encompassed in the DrugBank 40% 
without primary literature but these may be PDB entries for 
which the individual reports are not cited. At 30% HMDB 
has the lowest literature coverage. An explanation is that 
the large number of lipid records can neither be cross-refer- 
enced to ChEMBL journal articles nor alternative PubMed 
IDs linked to a CID for the structure. In contrast to Drug- 
Bank, the proportion in TTD is up to 90%. This is corrobo- 
rated by a 90% overlap at the structure level (Figure 3). 
The most likely explanation here is that the recent TTD cu- 
ration efforts have actively selected ChEMBL entries as 
starting points and/or cross-links, whereas these only reach 
60% in DrugBank. 

2. ROF+250-800 (3 1812 051). This is a simple lead-like 
filter, encompassing drug-likeness at the lower MW end. 
While not predictive per se, this filter enriches for bioactive 
compounds (n.b. PubChem overall is skewed upwards in 
relative proportion because 75% of all vendor depositions 
are this range). Note that Figure 3 assists in the interpreta- 
tion of the MW dimension. We can see that ChEMBL and 
DrugBank are similar at 58% and 61 %, respectively. Lowest 
in this filter is HMDB at 10% but this is not unexpected 
considering many metabolites would fall below MW 250 
and the substantial lipid content would be excluded on the 
basis of both LogP and MW. The data do not indicate any 
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Figure 4. CID matches for selected sources in PubChem. The panels are: (a) ChEMBL, (b) DrugBank, (c) HMDB, and (d) TTD. The ranking of 
matches in ChEMBL was taken as the reference order for the other three plots. 



particular reason why TTD (49%) should be lower than 
both ChEMBL and DrugBank. 

3. Active in PubChem Bioassay (883627). For ChEMBL the 
figure of 55% seems unexpectedly low since this source is 
the major contributor to BioAssay, as mostly positive activi- 
ty results of one sort or other extracted from the literature. 
However, ChEMBL data in PubChem does not get an 
active/inactive tag generated via thresholds in the same 
way as is done for the Molecular Libraries Screening Centre 
Network (MLSCN) result sets. The fact that this drops to 
48% for DrugBank where one might expect positive linked 
assay results for all drug candidates may be related to the 
same active flagging issue. The 9% level of actives in 
HMDB is actually higher than anticipated since metabolites 
are not expected to be inhibitors at concentrations typically 
tested in assays but this is partially explained by drug con- 
tent. The fact that TTD (87%) ranks significantly above 
ChEMBL in this filter also supports the idea that curation 
was specifically picking up compounds with potency data 
from ChEMBL 

4. Patent Sources (15 039 047). The observation that 
-50% of ChEMBL structures are in patent sources, com- 
pared to 13% overall in PubChem, can be explained by 
structures from the medicinal chemistry literature being 
first exemplified in patents. This rises to 67% and 71 % for 
DrugBank and TTD respectively reflecting higher (propor- 



tional) drug and clinical candidate content. While TTD is 
low (26%) we would not expect metabolites to be claimed 
as structures. What may contribute to this are not only the 
drugs but also biochemical names in the dictionaries used 
for manual or automated patent extraction. 

5. Source-Unique (25806 124). This means the CID is spe- 
cifically only from one submitter. At -20% the unique con- 
tent of ChEMBL is the highest in the set, however, it would 
be even -100 K CIDs higher without the circularity arising 
from common chemical content between BindingDB and 
ChEMBL (i.e. the intersect between them represents 94% of 
the former). The interpretation here is that many of the 
structures extracted from papers by ChEMBL are (by CID 
rules) not hitherto represented in other PubChem sources. 
The caveats are not only that alternative stereo and/or 
other tautomeric representations may be present (i.e. 
single-source CIDs are not all canonically unique) but that 
any may be "correct" in reflecting representations derived 
from different extracted papers. As an example, the CID for 
CHEMBL1 797692 (CID: 56680063, RAERAPYSCWYQAO-AKI- 
FATBCSA-N) has fully specified stereo but the InChlKey skel- 
eton matches a "flat" supplier compound (CID: 71369165, 
RAERAPYSCWYQAO-UHFFFAOYSA-N). The unique content 
of DrugBank and TTD is very low. The implication is that 
their structures are independently corroborated by other 
submissions merged in the CID records (e.g. ChEMBL as in- 
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dicated above). However, this would be confounded if 
a proportion of curation was circular (i.e. did not involve de 
novo SID generation). At 38% the unique content of HMDB 
is high but 90% of this has a MW of above 800. Ranking 
these by MW and visual inspection indicates the unique 
content is substantially derived from complex lipids. This 
corroborates the MW results but note that -75% of the 
HMDB structures are not yet captured in this PubChem 
analysis. Corroborative information was supplied by an in- 
tersect with LipidMaps at 3576 (i.e. 35% inside PubChem). 

6. Disconnected structures of 2 or more components 
(1571436). This is a useful measurement for salts and mix- 
tures (note this cannot technically discriminate dimers or 
other multimers but inspection suggests the occurrence of 
these is low). Only ChEMBL has a significant count (6.5%) 
which includes salt forms of drugs specified in the literature 
and USAN approvals. 

7. NIH Molecular Libraries (397823). This is the physical 
collection shared between US screening centres. These 
structures can accumulate extensive cross-reactivity data 
depending on how long they have been in the collection. 
On a proportional basis the sources here do not have large 
intersects but ChEMBL is highest on an absolute basis. 
DrugBank is up to 25% which may be related to the high 
coverage of drugs and candidates. 

8. INN/USAN (10388). Using this as an "and/or" query with 
a restriction to the PubChem Compound synonym field re- 
turns both currently approved drugs and historical ad- 
vanced-stage candidates. It thus constitutes a comprehen- 
sive drug collection that is nominally independent (i.e. the 
above databases are not usually the first INN or USAN syno- 
nym-assigning sources). With a 72% intersect (with the 
total) ChEMBL has the largest coverage while both Drug- 
Bank and TTD are surprisingly low at 14% and 19% respec- 
tively. Possible explanations include differences in stereo 
and or salt forms. The databases may also have a lag time 
for newly approved drugs but there is no PubChem source 
from which these can be cleanly selected for comparison. 
The fact that HMDB includes a selection of drugs (229) is 
mentioned in their 2013 paper but note that some metabo- 
lites, vitamins and hormones also have INNs or USANs for 
pharmaceutical formulations. 

9. Pharmacological Actions (11 912.) This important subset 
flags up where in vivo activity is assigned to a structure via 
MeSH curation of one or more PubMed IDs. It therefore in- 
dicates therapeutic testing of drug action in animal models 
and clinical trials with useful specificity. Here again ChEMBL 
scores high compared to DrugBank and TTD (at 51 % , 13% 
and 18% respectively). Contributing factors here are the 
much larger scale of ChEMBL and the fact that many of the 
research compounds captured in DrugBank and TTD do 
not have in vivo characterisation data. The low figure in 
HMDB would include the drug and hormone content. 

10. Protein 3D Structures (23 562). This category indicates 
CIDs identified within a protein structure. The inclusion of 
all hetero-atoms exceeds specifically pocket-bound small- 
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molecule ligands but intersecting these with the filter 
above (ROF + 250-800) indicates -7900 could be in this 
category. While ChEMBL ranks top in absolute numbers 
(5690), DrugBank is proportionally highest (51 %), signifi- 
cantly exceeding TTD (7%). This is in accord with the his- 
torical focus on ligands by DrugBank but this includes 
a small proportion of false-positives. An example is Alpha- 
D-Mannose (DB02944) classified as an experimental drug 
and mapped to 76 targets. The mapping is via hetero-atom 
entries rather than authentic ligands, although these pro- 
teins are not flagged with pharmacological action. 

11. Biosystems (9757). This NCBI resource maps com- 
pounds into pathways via protein target links in BioAssay 
records and pathway databases.' 171 Because of the many 
BioAssay target links in ChEMBL, the mapping is available 
for 20% of records there. In DrugBank and TTD the per- 
centage is much lower with 6.5% and 4.9%, respectively. In 
HMDB, this information is available for 17% of all records 
which seems to be a quite high number, especially if fac- 
tored by size relative to ChEMBL. However, by definition, 
many of the compounds are mapped into metabolic path- 
ways. 



4 Results for Proteins 

4.1 2010 vs. 2013 

We recoded protein content changes in all four databases. 
We improved the resolution of identifiers in 2013 by ac- 
cessing UniProt IDs directly, rather than use the Protein 
Identifier Cross-Reference Service (PICR) to BLAST-map 
downloaded FASTA sequences as we had utilised in 
201 0. [181 In addition, new DrugBank subsets have become 
available and HMDB content is being updated at the time 
of writing (May 2013). None of these changes are problem- 
atic per se since curation processes and underlying data- 
base schema are being continuously enhanced. Conse- 
quently, maintaining retro-consistency is impractical. How- 
ever, this does make interpreting differences between old 
vs. new protein sets less valid than the analogous chemis- 
try comparisons. This section will therefore focus more on 
comparing between the 2013 protein sets rather than old- 
vs-new. Nevertheless, we can start with 2010 vs. 2013 
(Figure 5). 

The results fall into two groups that we can term re- 
placement and expansion. Thus, ChEMBL and HMDB have 
predominantly expanded whereas the other two have un- 
dergone removals as well as additions. The major change 
for DrugBank is due to the recent option of being able to 
download just those proteins that are flagged "Pharmaco- 
logical action: yes" (highlighted in green in the Target re- 
cords). This approximates to what could be termed a pri- 
mary target mapping for the 716 proteins. While we could 
have chosen to compare different download sets from the 
same year, this would have made the analysis overly com- 
plex. Where a new choice is available, our default is to take 
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Figure 5. Protein content changes 2010 vs. 2013. The totals of the protein content for the two years are on top of each circle in the Venn 
diagram. 



the set with highest mapping specificity, even if this con- 
founds retrospective comparisons. In Figure 5 the TTD 2013 
set has also decreased but some of this may be due to dis- 
crepancies between our 2010 PICR cross-mapping and the 
2013 direct ID download. 

Before moving on to the current proteins we performed 
a three-way comparison between just the three drug dis- 
covery databases between 2010 and 2013 because HMDB 
is "odd-man-out" in not sharing a similar target mapping 
concept (Figure 6). The comparison of database consensi in 
Figure 6 should be more robust than individual sets. The 
time point results have three features. The first is that 
target protein capture expanded by 60%. Secondly, when 
examined by the Panther classification 1191 no significant 
shifts are detected (e.g. the receptor: enzyme ratio is similar 
despite expansion). The third feature is that all 32 proteins 
"lost" from the 2010 consensus are from DrugBank. The 
reason is that the smaller set of primary targets, selectable 
in 2013, has eliminated these 32 mappings. One of these, 
human serum albumin, provides particular classification 
challenges. In ChEMBL, P02768 is target-mapped to 650 
compounds (as CHEMBL3253) with most of the 239 entries 



described as small-molecule binding assays because there 
is no "carrier" option. DrugBank has three unique entries 
for this protein as a biopharmaceutical, but also maps 104 



392 654 




2010 2013 

Figure 6. Comparison of the 2010 and 2013 target protein consen- 
sus intersects between ChEMBL, DrugBank, and TTD. The consen- 
sus totals are shown above the circles. For HMDB no target protein 
information is available. 
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small-molecules to P02768, classified as a "carrier" relation- 
ship, however, as expected, none have a "pharmacological 
action". A different classification is used by TTD in mapping 
five small-molecules (including the antacid bismuth), to al- 
bumin (TTDS00336) but shows limited curatorial discrimina- 
tion in assigning the default relationship of "successful 
target" (presumably because the five compounds were ap- 
proved), in spite of the references clearly specifying the car- 
rier function (except for the inclusion of one imaging 
agent). To complete the inter-database albumin survey, 
HMDB also maps the protein to nine small molecules, in- 
cluding several hormones (arguably also as a carrier), but 
classifies the protein type as "unknown" (HMDBP020759). 

4.2 2013 Protein Comparisons 

We initially addressed some basic questions. The first was 
to examine correlations between the statistics reported by 
the databases and our ID downloads. For ChEMBL the tar- 
gets to UniProt ID ratio was 9570 : 5797 because not all tar- 
gets can be protein-mapped). For instance, whole-cell 
screening is widely used to measure the growth inhibition 
effects for anti-infectives of all types in the primary litera- 
ture. Consequently, searching ChEMBL with "Plasmodium" 
gives 14 results and "mycobacterium" gives 38 results. Over 
20000 compounds are thus mapped to these organisms as 
species "targets", a relationship type not typically captured 
by the two drug databases. The latest DrugBank reported 
statistics are 4167 targets, 221 enzymes, 11 carriers (includ- 
ing albumin) and 120 transporters. While we recorded 4023 
UniProt IDs from the data extractor download we selected 
the new primary target subset of 715. Currently, there are 
no reported protein content statistics for HMDB that we 
can compare with our download results but the mappings 
are under revision. The subsets from TTD specify 2025 "tar- 
gets", including 364 successful, 286 clinical trials, 44 discon- 
tinued and 1331 research. Our analysis found 1768 de-du- 
plicated UniProt IDs Thus, for TTD "target" does also not 
always equate to a UniProt ID. For example, TDC00233 
specifies "Gamma secretase" as a clinical trial target for two 
compounds. However, there is evidence of mixed curatorial 
rules because while this target entry was mapped to a Uni- 
Prot ID, in TTDR00444 and TTDR00445 the (same) 11 com- 
pounds are mapped to the gamma secretase presenilin 
1 and 2 subunits, respectively (P49768, P49810). 

Two other important questions are the species split be- 
tween human vs. non-human and the Swiss-Prot-to-TrEMBL 
ratio (SP:TR). Results from the UniProt query interface' 201 are 
shown below (Table 6). 

The details cannot be expanded here but some features 
can be interpreted. In the case of ChEMBL, as might be ex- 
pected from the wide range of primary medicinal chemistry 
literature extracted, the zoo of targets includes 45% 
human and 15% rodent (698 mice plus 199 rat). Some of 
the other 400 species seem counter-intuitive as drug tar- 
gets, such as Q42656 from Coffea Arabica (CHEMBL5217) 
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Table 6. Comparison of species and Swiss-Prot: TrEMBL ratios 
(SP:TR). 

ChEMBL DrugBank HMDB TTD 

All 5797 715 5647 1757 

Human 2621 576 5231 1277 

Species/strain total 401 63 375 173 

Largest non-human Mouse (698) E.Coli (45) Hep C (32) E.Coli (89) 

SP:TR 5222:575 691:24 5376:271 1620:137 

Human TR-only 20 4 270 12 



with 94 compounds aligned against it. It turns out this is 
a mechanistic exemplar enzyme for cross-screening human 
alpha-mannosidase inhibitors as potential antitumor 
agents. DrugBank has a distinct human vs. anti-infectives 
split but is rodent-free because orthologous substitution 
has been their chosen curatorial practice. This means 
human protein IDs replace rodent or other mammalian pro- 
teins specified in the references for the drug entries." 11 
Analogously, there are a number of gram + ve antibacterial 
compounds where the E.Coli orthologue has been added 
or substituted for the Staph, or Strep, target. Many species 
are also included in TTD but TTDR00218 (023733), a cys- 
teine synthase from Brassica juncea is an error. The viral 
strain polymerases in HMDB will be removed during a cur- 
rent revision of protein mappings. 1111 

The topic of TrEMBL entries generated by automated an- 
notation is also too detailed to go into here. However, 
while the number of human TrEMBL entries is low (Table 6) 
their presence indicates probable curatorial errors and/or 
updating lapses. The reason is that the Swiss-Prot expert 
review process is essentially complete for the human can- 
onical proteome and certainly for all plausible drug targets. 
Thus, human mappings should have a Swiss-Prot ID rather 
than a TrEMBL (or both). To be fair, the current human 
SPTR ratio of 20 255: 113 824 makes assigning an incorrect 
(or quasi-duplicate) entry an easy curatorial error to make 
because of the 5-fold excess of accession numbers for pro- 
teins with partially shared automated annotation but relat- 
ed by splice forms, minor sequence differences or as frag- 
ments. 

The data underline the marked differences in species 
capture between the four databases. We chose human-only 
comparisons since they have a number of advantages com- 
pared to using total UniProt IDs. Firstly, this normalises the 
comparison (i.e. apples vs. apples). Secondly, we can com- 
pare the three drug-centric databases as a separate set 
and, thirdly, comparisons of protein function distribution 
are more valid for single-species. Only salient features can 
be highlighted, but note that Venny can be used to repro- 
duce any of these via the supplementary data protein lists. 
Given the caveats mentioned for 2010 vs. 2013 three points 
are clear from Figure 7a and 7b. Total protein capture is 
high, the consensus is low and the four databases have, in 
general, further diverged. Taking human-only reduces the 
2013 divergence (Figure 7c). Comparing just the drug data- 
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Figure 7. Four-way Venn diagrams of UniProtID content. These are shown as: a) 2010, b) 2013, c) human-only 2013, and d) human-only for 
the three drug databases in 2013. The ratio of intersect between the sources and the total union in each case is: a) 298:12512, b) 
298:10036, c) 301 :6275, and d) 351 :3046. 



bases for human-only increases the 3-way consensus (Fig- 
ure 7d) to a probable approximation of approved drug tar- 
gets but the union of the three encompasses over 15% of 
the genome. 

4.3 Protein Functional Categories 

Having normalised the databases to human protein IDs 
there are many options for property and annotation com- 
parison that could provide insights into coverage selectivi- 
ty. We have chosen the Genome Ontology (GO) molecular 
function because this is conveniently generated via the 
Panther web site. 1211 This usefully provides a functional dis- 
tribution for any set of protein IDs but with the caveat that 



GO terms are both nested (i.e. top categories are broad) 
and forked (some proteins are assigned to multiple func- 
tions). While this means the derived pie charts should not 
be over-interpreted, they are nonetheless useful for 
a broad-brush comparison of sources (Figure 8). 

From left to right, both ChEMBL and HMDB have the 
highest proportions of enzymes. It is also clear that HMBD 
has a capture scope that extends beyond metabolic en- 
zymes. DrugBank is similar to TTD but the latter has pro- 
portionally less ion channels and receptors. DrugBank looks 
most similar to the consensus proteins in the high propor- 
tion of receptors, enzymes and ion channels. This is unsur- 
prising, since Figure 7d indicates 60% of the primary target 
proteins are subsumed into the 3-way consensus. 
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Figure 8. Gene Ontology molecular function distributions. These are shown as top-level categories for human proteins in: (a) ChEMBL, (b) 
DrugBank, (c) HMBD, (d) TTD, and (e) the three-way consensus of ChEMBL, DrugBank and TTD. The last panel has the colour key to the GO 
molecular function categories. Total proteins in each case are the central intersects from Figure 7c and 7d. 



4.4 Approved Target Lists 

The target proteins of approved small-molecule drugs are 
of intense interest but listings that have appeared in the lit- 
erature have not typically been compared. We have used 
this work as an opportunity to make such comparisons by 
using three human-only sets. The first of these is the 3-way 
consensus between the 2013 versions of ChEMBL, Drug- 
Bank and TTD (i.e. the 351 proteins in the centre of Fig- 
ure 7d). The second is a list of targets derived from a re-cu- 
ration of DrugBank in 2011 (RAS set). 1221 The third is an ex- 
tended list from a commercial database compiled in 2011, 
each of which had chemical modulation data in papers or 
patents (SOU set). 1231 

Notably, we see concordance and discordance. While the 
common set is only 220 targets it represents a five-way 
consensus (although the RAS set used an earlier DrugBank 
version as the starting point the final list was independent- 
ly curated). Analogously, the overlap sets of size 53, 58 and 
76 represent a two-way consensus. The set with at least 
two intersects (407) would thus be a good approximation 
to a human primary target set (up to 2011). The extensive 
unique content from the SOU set is expected because this 
includes a "long tail" of research targets, 473 of which had 
a chemistry-to-protein relationship curated from only one 
document. We can utilise GO classification charts (Figure 9) 
to compare the protein lists from different sections of the 
Venn diagrams. 



Taking the 2-or-3-way set first, not unexpectedly, because 
it contains only 56 more proteins, this looks similar to the 
351 protein set from the 3-way drug database consensus 
and DrugBank primary targets. Also not unexpected is that 
the large number of proteins in the SOU-only set show the 
highest proportion of enzymes because these constitute 
a substantial part of the long tail of research targets. Given 



SOU 

1654 




3-Way 

351 

Figure 9. Venn diagram of approved drug targets sets. The SOU 
1654 set is from PMID 21569515, RAS 437 from PMID 21804595 
and the 3-Way, 351, is the ChEMBL, DrugBank, TTD intersect from 
Figure 7d. 
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(a) SOU-only 1395 




(b) 3 -way-only 20 





(c) RAS-only 82 



(d) 2-or-3-way 407 



Figure 10. Gene Ontology molecular function distributions for approved target sets. 



that the other two unique sets are small, the distributions 
need to be interpreted with caution, but they do indicate 
differences. While the selectivity that might have caused 
this would need a detailed analysis, we can take one exam- 
ple. The dark purple sector in Figure 10b contains structural 
proteins (GO: 0005198) that are not typically drug targets. 
It turns out that one of these unique to the 3-way set is Tu- 
bulin beta-2 chain (P68371). This duly has entries in the 
three databases as TTDS00389, CHEMBL1848 and DrugBank 
target 2499. However, the large number of tubulin protein 
components for different species have resulted in inconsis- 
tent curatorial choices for microtubule modulators reported 
in the literature. In the SOU set these may have been as- 
signed to microtubule as a target designation (i.e. without 
a protein ID) and in the RAS set to Tubulin beta chain 
(P07437). The small molecule mappings for this mechanism 
of action are rendered even more complex because vincris- 
tine (CHEMBL303560) is mapped to three non-human tubu- 
lin proteins in ChEMBL, whereas in DrugBank, the anthel- 
mintic albendazole (DB00518) is ortholgously mapped to 
two human tubulins (Q71U36 and P68371). Because these 
are both classified as pharmacologically significant, these 
proteins are included in the DrugBank human targets for 
approved drugs. 

Our final protein content comparison looked at map- 
pings associated with just one drug, atorvastatin (Lipitor), 
an inhibitor of 3-hydroxy-3-methylglutaryl-coenzyme A re- 
ductase, HMGCR (P04035). The Venn diagram (Figure 11) 
shows pronounced differences. It is important to note that 
these protein relationships are to the parent molecule (i.e. 
not salts) in each case (CHEMBL1487, DB01076, 



HMDB05006 and DAP000553). All four have one target 
(P04035) in common but no others. Notably, ChEMBL con- 
nects 112 proteins to atorvastatin in 2013 but only three in 
2010. It turns out the majority of new mappings come 
from a Drugmatrix panel screen added in ChEMBL v.15. This 
includes 103 proteins with 1742 associated activity meas- 
urements (CHEMBL1 909046). Compared to the other three 
databases where the predominant relationship is selected 
as an activity modulation, panel screens can record the ab- 
sence of activity (at the maximum concentration tested) 
across a substantial proportion of the matrix, but mappings 
are recorded in the database for every protein in the panel 
(i.e. the relationship is "has been assayed"). In SAR terms 
the inclusion of both activity and inactivity are important. 



DrugBank 

3 




Figure 11. Venn diagram of protein identifiers linked to atorvasta- 
tin. Totals are indicated at the top of each ellipse. 
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However, users then need to apply filtration and threshold 
judgements when querying the data for their particular 
needs. 

Beyond one other protein in common (P08684, Cyto- 
chrome P450, 3A4) the remaining unique assignments indi- 
cate marked differences in relationship capture between 
the databases. As can be seen in the Venn diagram, TTD 
has P04035 mapped to atorvastatin as single primary 
target. On the other hand, the DrugBank record, (DB01076) 
is mapped (in 2010 and 2013) to two additional targets, Di- 
peptidyl peptidase 4 (DPPIV, P27487) and the Aryl hydrocar- 
bon receptor (P35869). Only P04035 is captured in the 
2013 download because this is tagged as known pharma- 
cology. It turns out that DPPIV is also linked to atorvastatin 
in ChEMBL but to pig not human (P22411). Inspection of 
the record shows this to be the correct link via PMID 
18068977 but the data quoted in the abstract (i.e. a K, of 
58 uJVI) is not captured in the record. DrugBank cites the 
same reference but orthologously substitutes pig for 
human and tags the relationship as having no pharmacolo- 
gy. Given the high K t this seems an appropriate curatorial 
judgment (this transitive assignment has been incorporated 
into the PubChem record for atorvastatin). However, further 
mapping complexity is introduced via CHEMBL393220. This 
maps the atorvastatin calcium salt as a separate entity to 
a series of antimalarial whole-parasite screens plus rat 
HMGCR (P51639). For HMDB three of the 17 proteins have 
reaction schema related to HmgCoA metabolism. However, 
the reasons for mapping the other 14 proteins into this 
entry are unclear. 



5 Conclusions 

Many aspects of databases can be compared in order to 
discern utility. These include the data model, web interface, 
query and navigation functionality, cross-links, download 
sets, API availability, facility for integration, PubMed content 
and mapped relationship distributions (e.g. proteins-per- 
compound and compounds-per protein). Despite the 
impact of these on exploitation we have had to limit our- 
selves here to what we consider the two most important 
features of chemistry and protein content. The approaches 
outlined are generic and we thus encourage others to per- 
form such studies, especially where broader adoption of In- 
ChlKey and UniProtlDs make comparisons more straightfor- 
ward than hitherto. The databases in this study provided 
useful download options but differ in the extent of post- 
download processing necessary for standardised compari- 
sons. Thus, broader intra-database harmonisation would be 
welcome, for example, if each of the chemical and target 
sets could be downloaded as SD files, protein ID lists in 
Excel, UniProt cross-references, have an API (as ChEMBL has 
already) as well as complete and selectable PubChem sets. 

Our chemistry results show that all the sources have im- 
proved. This is likely to be a combination of enhanced 
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structure handling rules and manual curation. While this 
has increased overlap there is also divergence and expan- 
sion. This enhances their complementarity and aggregate 
coverage. Nevertheless, questions as to "how they became 
the way they are" are important for exploitation. In this re- 
spect the ChEMBL publications, regular release notes, and 
explanations of their medicinal chemistry journal triage, go 
a long way to answering these questions. While the other 
databases are also well described in their publications, anal- 
ysis is necessary to discern selectivity that is not made ex- 
plicit. Examples include the fact that DrugBank is still over 
50% PDB derived, TTD has used ChEMBL for their expan- 
sion, HMDB contains many drugs and that lipids are now 
a major proportion of HMBD content. 

A limitation of the current study is the restriction to 
exact matches between the chemistry collections. Apparent 
increases in divergence by this parameter may not necessa- 
rily translate into wider chemical coverage. In this respect, 
the impact of these databases is not only on accessing 
data, but using it to develop models. Thus, updates will 
have relatively little impact on QSAR and similarity models 
if similarity remains high. Further work would be needed to 
see if distribution of pairwise similarities (2D and/or 3D) 
had shifted significantly between 2010 and 2013. 

The aggregate protein content for the four databases in 
2013 shows a major expansion encompassing not only 
binding events in the thermodynamic sense but also bio- 
chemical and pharmacological activity. However, the high 
numbers we have recorded, along with some of the indi- 
vidual examples, indicate divergent curatorial rules and 
stringencies for each source. For example, the three drug- 
centric databases (Figure 7d) cover 3046 human proteins 
(i.e. -15% of the genome). This contrasts with the active 
compound mapping from an extensive corpus of literature 
and patents that recorded relationships to 1654 proteins in 
2011, although a subsequent report added internal compa- 
ny data to support up to 2000 mappings. 123 241 The 5232 
proteins in HMDB also provide a maximal-mapping exam- 
ple, since a different approach recorded only 1653 human 
metabolic enzymes. 1251 There are arguments for extending 
mappings to indirect and assayed relationships, as opposed 
to a more stringent restriction to only potent activity mod- 
ulation or direct metabolic interaction. Nevertheless, ex- 
tended mappings can be problematic. For example, they 
have the potential to confound Linked data integration if 
the source-specific filtration options are submerged. 1261 A 
second concern is proliferation beyond the sources they 
were first incorporated in. This can occur not only via auto- 
mated cross-linking but also by curatorial transfer between 
databases. 
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